Skip to main content

Enron Email Dataset

Item Preview

There Is No Preview Available For This Item

This item does not appear to have any files that can be experienced on Archive.org.

Show all files

web
Enron Email Dataset


Published August 21, 2009
Topics Enron, E-mail, Dataset


This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.

The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them (not me) that the dataset is available. The dataset here does not include attachments, and some messages have been deleted "as part of a redaction effort due to requests from affected employees". Invalid email addresses were converted to something of the form user@enron.com whenever possible (i.e., recipient is specified in some parse-able format like "Doe, John" or "Mary K. Smith") and to no_address@enron.com when no recipient was specified.


I get a number of questions about this corpus each week, which I am unable to answer, mostly because they deal with preparation issues and such that I just don't know about. If you ask me a question and I don't answer, please don't feel slighted.
I am distributing this dataset as a resource for researchers who are interested in improving current email tools, or understanding how email is currently used. This data is valuable; to my knowledge it is the only substantial collection of "real" email that is public. The reason other datasets are not public is because of privacy concerns. In using this dataset, please be
sensitive to the privacy of the people involved (and remember that many of these people were certainly not involved in any of the actions which precipitated the investigation.)
  • March 2, 2004 Version of dataset and the August 21, 2009 Version of dataset are no longer being distributed. If you are using this dataset for your work, you are requested to replace it with the newer version of the dataset below, or make the the appropriate changes to your local copy. A total of four messages have been removed since the original version of the dataset.

There are also at least two on-line databases that allow you to search the data, at Enronemail.com and UCB



Identifier 2011_04_02_enron_email_dataset
Addeddate 2013-07-06 08:26:08
Creator William W. Cohen, MLD, CMU
Mediatype web
Date 2009-08-21
Year 2009
Publicdate 2013-07-06 08:33:26
Backup_location ia905703_8

comment
Reviews

There are no reviews yet. Be the first one to write a review.
SIMILAR ITEMS (based on metadata)
Tucows Software Library
by http://www.email-monitoring.net
software
eye 47
favorite 0
comment 0
MusicBrainz Data Dumps
collection
360
ITEMS
11,736
VIEWS
collection
eye 11,736
The Dataset Collection
data
eye 274
favorite 0
comment 0
Internet Census 2012
collection
15
ITEMS
4,276
VIEWS
by Anonymous
collection
eye 4,276
The Dataset Collection
by Weiwei Zhang, Jian Sun, and Xiaoou Tang
data
eye 18
favorite 0
comment 0
KXJZ's Insight
audio
eye 542
favorite 1
comment 0
Tucows Software Library
by http://www.1cis.com
software
eye 248
favorite 0
comment 0
The Dataset Collection
data
eye 31
favorite 0
comment 1
favoritefavoritefavoritefavorite ( 1 reviews )
The Dataset Collection
data
eye 406
favorite 0
comment 0
Archive Team
texts
eye 197
favorite 0
comment 0